30 research outputs found

    On the Development and Evaluation of a Brazilian Portuguese Discourse Parser

    Get PDF
    We present in this paper the development process and the evaluation procedure of a Brazilian Portuguese discourse parser called DiZer. Based on Rhetorical Structure Theory, DiZer is a symbolic cue phrase-based analyzer that makes use of discourse templates learned from a corpus of scientific texts to identify and build the discourse structure of texts. DiZer evaluation shows satisfactory results for scientific and news texts, even tough it was not designed for the latter, which demonstrates DiZer portability.Apresentamos neste artigo o processo de desenvolvimento e avaliação de um analisador discursivo automático para o português brasileiro. Seguindo a Teoria de Estruturação Retórica, o DiZer é um sistema simbólico baseado na ocorrência de marcadores textuais, fazendo uso de templates discursivos extraídos de um corpus de textos científicos para identificar a construir a estrutura discursiva de textos. A avaliação do DiZer mostra resultados satisfatórios para textos científicos e jornalísticos, apesar do sistema não ter sido delineado para o gênero jornalístico, o que demonstra a portabilidade do sistema

    Lexicon-based sentiment analysis for reviews of products in brazilian portuguese

    Get PDF
    This paper presents some results on lexicon-based classification of sentiment polarity in web reviews of products written in Brazilian Portuguese. They represent a first step towards a robust opinion miner from reviews of technology products. The evaluation shows the performance of 3 different sentiment lexicons combined with simple strategies. It is also discussed the risk of considering the rating provided by the writers for the purpose of evaluating the algorithms. The results\ud show that the better combination is the version of the algorithm that deals also with negation and intensification and uses the sentiment lexicon Sentilex. The average F-measure achieved 0.73.Samsung Eletrônica da Amazônia Ltda

    NILC_USP: an improved hybrid system for sentiment analysis in Twitter messages.

    Get PDF
    This paper describes the NILC USP system that participated in SemEval-2014 Task 9: Sentiment Analysis in Twitter, a re-run of the SemEval 2013 task under the same name. Our system is an improved version of the system that participated in the 2013 task. This system adopts a hybrid classification process that uses three classification approaches: rule-based, lexiconbased and machine learning. We suggest a pipeline architecture that extracts the best characteristics from each classifier. In this work, we want to verify how\ud this hybrid approach would improve with better classifiers. The improved system achieved an F-score of 65.39% in the Twitter message-level subtask for 2013 dataset (+ 9.08% of improvement) and 63.94% for 2014 dataset.FAPESPSAMSUN

    A importância dos falsos homógrafos para a correção automática de erros ortográficos em português

    Get PDF
    This paper reports the analysis of 25.722 pairs of Portuguese words that differ from each other by a single diacritic, called “false homographs”. Such words are relevant for spelling correction, as in these cases a misspelled word missing a diacritic is identical to a correct word, consequently preventing the identification and the correction of the misspelling. The purpose of the analysis is to identify and to exclude, from the lexicon used by a Portuguese speller, non-accented words that are relatively less frequent than their respective accented pairs. This action is specially justified when one aims to correct User-Generated Content (UGC), a kind of text characterized by missing diacritics, among other features. The result is a list of 2.052 words that fit the requirements of the aimed strategy.Este artigo relata a análise de 25.722 pares de palavras em português que só diferem por um acento. Essas palavras são denominadas aqui de “falsos homógrafos” e são relevantes para a correção de erros ortográficos, pois nesses casos uma palavra incorreta à qual falta um acento é idêntica a uma forma correta na língua, o que impede a identificação do erro e sua consequente correção. O propósito da análise é identificar pares em que a forma não acentuada tenha baixa frequência e a forma acentuada tenha alta frequência, e assim excluir, do léxico que servirá de base para o corretor ortográfico, as formas pouco frequentes. Essa proposta justifica-se especialmente quando se almeja a correção ortográfica de Conteúdo Gerado por Usuários na web (CGU), um tipo de texto caracterizado, entre outras coisas, pela falta de acentos. O resultado é uma lista de 2.052 palavras que atendem às condições da estratégia pretendida.Samsung Eletrônica da Amazônia Ltd

    Assessing the contribution of shallow and deep knowledge sources for word sense disambiguation

    No full text
    Corpus-based techniques have proved to be very beneficial in the development of efficient and accurate approaches to word sense disambiguation (WSD) despite the fact that they generally represent relatively shallow knowledge. It has always been thought, however, that WSD could also benefit from deeper knowledge sources. We describe a novel approach to WSD using inductive logic programming to learn theories from first-order logic representations that allows corpus-based evidence to be combined with any kind of background knowledge. This approach has been shown to be effective over several disambiguation tasks using a combination of deep and shallow knowledge sources. Is it important to understand the contribution of the various knowledge sources used in such a system. This paper investigates the contribution of nine knowledge sources to the performance of the disambiguation models produced for the SemEval-2007 English lexical sample task. The outcome of this analysis will assist future work on WSD in concentrating on the most useful knowledge sources

    Correlations between structure and dynamics in complex networks

    Get PDF
    Previous efforts in complex networks research focused mainly on the topological features of such networks, but now also encompass the dynamics. In this Letter we discuss the relationship between structure and dynamics, with an emphasis on identifying whether a topological hub, i.e. a node with high degree or strength, is also a dynamical hub, i.e. a node with high activity. We employ random walk dynamics and establish the necessary conditions for a network to be topologically and dynamically fully correlated, with topological hubs that are also highly active. Zipf's law is then shown to be a reflection of the match between structure and dynamics in a fully correlated network, as well as a consequence of the rich-get-richer evolution inherent in scale-free networks. We also examine a number of real networks for correlations between topology and dynamics and find that many of them are not fully correlated.Comment: 16 pages, 7 figures, 1 tabl

    A large corpus of product reviews in Portuguese: tackling out-of-vocabulary words

    Get PDF
    Web 2.0 has allowed a never imagined communication boom. With the widespread use of computational and mobile devices, anyone, in practically any language, may post comments in the web. As such, formal language is not necessarily used. In fact, in these communicative situations, language is marked by the absence of more complex syntactic structures and the presence of internet slang, with missing diacritics, repetitions of vowels, and the use of chat-speak style abbreviations, emoticons and colloquial expressions. Such language use poses severe new challenges for Natural Language Processing (NLP) tools and applications, which, so far, have focused on well-written texts. In this work, we report the construction of a large web corpus of product reviews in Brazilian Portuguese and the analysis of its lexical phenomena, which support the development of a lexical normalization tool for, in future work, subsidizing the use of standard NLP products for web opinion mining and summarization purposes.University of São PauloSamsung Eletrônica da Amazônia LtdaFAPESPCNP
    corecore